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Summary 

We introduce a new class of Sequential Monte Carlo (SMC) methods, which we 
call free energy SMC. This class is inspired by free energy methods, which origi- 
nate from Physics, and where one samples from a biased distribution such that a 
given function £^(9) of the state 9 is forced to be uniformly distributed over a given 
interval. From an initial sequence of distributions (ivt) of interest, and a partic- 
ular choice of ^(6), a free energy SMC sampler computes sequentially a sequence 
of biased distributions {it) with the following properties: (a) the marginal dis- 
tribution of i^{d) with respect to it is approximatively uniform over a specified 
interval, and (b) it and irt have the same conditional distribution with respect 
to ^. We apply our methodology to mixture posterior distributions, which are 
highly multimodal. In the mixture context, forcing certain hyper-parameters to 
higher values greatly faciliates mode swapping, and makes it possible to recover 
a symetric output. We illustrate our approach with univariate and bivariate 
Gaussian mixtures and two real-world datasets. 

Keywords and Phrases: Free energy biasing; Label switching; Mixture; 
Sequential Monte Carlo: particle filter. 



1. INTRODUCTION 

A Sequential Monte Carlo (SMC) algorithm (a.k.a. particle filter) samples iteratively 
a sequence of probability distributions (7rt)t=o,...,T, through importance sampling and 
resampling steps. The initial motivation of SMC was the sequential analysis of dy- 
namic state space models, where vrt stands for the filtering distribution of state (la- 
tent var iable) xt, con d itiona l on the data y-i.t collected up to time t\ see e. g . the 
book of [ Doucet et alJ l|200ll ). Recent research however l|Neall . 1200 ll : IChopinl. l2002l : 
iDel MoraTeTliLl . 120061 ) have extended SMC to "static" problems, which involves a 
single, but "difficult" (in some sense we detail below) distribution tt. Such extensions 

use an artificial sequence (7rt)t=o t, starting at some "simple" distribution ttq, and 

evol ving smoot hly tow ards ttt = tt. Two instances of such strategies are i) anneal- 
ing (|Nealll2001l . see also I GeIman"and~M cng 1998), where TTt{e) = 7ro(6l)^"^*7r(6l)^S and 
jt = t/T , or so me other increasing sequence that starts at and ends at 1; and ii) IBIS 
(|Chopinl . 120021 ) ■ where n stands for some Bayesian posterior density tv{9) = p(S|i/i:t), 
conditional on some complete dataset yi-.T, and Trt{d) = p{6\yi:t)- For a general for- 
malism for SMC, see IDel Moral et all (2006[ ). 

One typical "difficulty" with distributions of interest tt is multimodality. A vanilla 
sampler typically converges to a single modal region, and fails to detect other modes, 
which may be of higher density. The two SMC strategies mentioned above alleviate 
this problem to some extent. In both cases, ttq is usually unimodal and has a large 
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support, so "particles" (sampled points) explore the sample space freely during the 
first iterations. However, this initial exploration is not always sufficient to prevent the 
sample to degenerate to a single modal region. We give an illustration of this point in 
this paper. 

To overcome multimodality, the molecular dynamics community has developed 
in recent years an interesting class of method s, based on the concept of free energy 
biasing; see for instance the book of Lelievre et al. (20f0) for a general introduction. 
Such methods assume the knowledge of a low-dimensional function (,{0), termed as the 
"reaction coordinate", such that, conditional on ^(9) = x, the multimodality (a.k.a. 
mestability in the physics literature) of tt is much less severe, at least for certain 
values of x. The principle is then to sample from tt, a free energy biased version of 
TT, Tr{9) = tt{9) exp {A o ^{9)}, where A denotes the free energy, that is, minus log the 
marginal density of the random variable ^{9), with respect to tt. This forces an uniform 
exploration of the random variable ^(6^), within given bounds. At a final stage, one 
may perform importance sampling from vf to tt to recover the true distribution tt. 

The main difficulty in free energy biasing methods is to estimate the free energy A. 
A typical approach is to compute sequentially an estimate A^f) of A, using some form 
of Adaptive MCMC (Markov chain Monte Carlo): at each iteration t, a MCMC step is 
performed, which leaves invariant 7r(t)(0) — 'k{9) exp {Ajfj o ^(0)}, then a new estimate 
A{^t+i) of the free energy is computed from the simulated process up to time t. The 
simulation is stopped when the estimate A(t) stabilises in some sense. Convergence of 
Adaptive MCMC samplers is a delicate subject: trying to learn too quickly from the 
past may prevent convergence for instance. These considerations are outside the scope 
of this paper, and we refer the interested reader to the review bv lAndrieu and Thomsl 
(poos) and references therein. 

Instead, our objective is to bring the concept of free energy biasing to the realm of 
SMC. Specifically, and starting from some pre-specified sequence Ijit), we design a class 
of SMC samplers, which compute sequentially the free energy At associated to each dis- 
tribution TTt, and track the sequence of biased densities 7rt(S) = 7rt(^) exp {At o ^(S)}. 
In this way, particles may move freely between the modal regions not only at the early 
iterations, where yrt remains close to ttq and therefore is not strongly multimodal, but 
also at the later stages, thanks to free energy biasing. 

We apply free energy SMC sampling to the Bayesian analysis of mixture models. 
IChopin et al.l 12010) show that free energy biasing methods are an interesting approach 
for dealing with the multimodality of mixture posterior distributions. In particular, 
they present several efficient reaction coordinates for univariate Gaussian mixtures, 
such as the hyper-parameter that determines the prior expectation of the component 
variances. In this paper, we investigate how free energy SMC compares with this 
initial approach based on Adaptive MCMC, and to which extent such ideas may be 
extended to other mixture models, such as a bivariate Gaussian mixture model. 

The paper is organised as follows. Section [2] describes the SMC methodology. 
Section [3] presents the concept of free energy biased sampling. Section U presents 
a new class of SMC methods, termed as free energy SMC. Section [5] discusses the 
application to Bayesian inference for mixtures, and presents numerical results, for two 
types of mixtures (univariate Gaussian, bivariate Gaussian), and two datasets. Section 
[5] concludes. 



2. SMC ALGORITHMS 
2.1. Basic structure 

In this section, we describe briefly the structure of SMC algorithms. For the sake of 
exposition, we consider a sequence of probability densities nt, t = Q, . . . ,T defined on 
a common space Q C R''. At each iteration t, a SMC algorithm produces a weighted 
sample {wt,n, 9t,n), n = 1, . . . , N , which targets nt in the following sense: 

E^=i wt.nV?(6't.n) r^„(a\\ 
— s^iv^+oo IB. W{9)^, 

almost surely, for a certain class of test functions (p. At iteration 0, one typically 
samples 9o,n ~ ttq, and set wo,n = 1. To progress from iteration t — 1 to iteration t, it 
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is sufficient to perform a basic importance sampling step from nt~i to nt: 

9t,n = dt-l,n, Wt,n = Wt-l,n X Ut{6t,n) 

where ut denotes tlie incremental weigfit function 

MO) 



ut{e) = 



However, if only importance sampling steps are performed, the algorithm is equivalent 
to a single importance sampling step, from ttq to ttt- This is likely to be very inefficient . 
Instead, one should perform regularly resample-move steps (|Gilks and Berzuinil. [2OOII ') . 
that, is, a succession of i) a resampling step, where current points Ot,n are resampled 
according to their weights, so that points with a small (resp. big) weight are likely to 
die (resp. generate many offsprings); and ii) a mutation step, where each resampled 
point is "mutated" according to some probability kernel Kt(6, d9), typically a MCMC 
kernel with invariant distribution vrt. In the more general formalism of Del Moral et af] 
(|2006l '). this is equivalent to performing an importance sampling step in the space 0X0, 
with forward kernel Kt, associated to some probability density Kt{9, 9), and backward 
kernel Lt associated to the probability density Lt{9,9) = ■Kt{9)Kt{9,9)/nt{9). 

Resample-move steps should be performed whenever the weight degeneracy is too 
high. A popular criterion is EF(t) < r, where r g (0, 1), and EF is the efficency factor, 
that is the effective sample size of lKong et aD \im4 ) divided by iV, 

EF(f) - ^ ^ 



NT.:-. 



We summarise in Algorithm 1 the general structure of SMC algorith ms. There are 
sever al methods for resampli ng the particles, e.g. t he multinomial scheme (iGordon et al.l . 
1993) . the residual scheme (|Liu and Chenl . 1 199^ . the systematic scheme d WhitlevL 



1994 : (Carpenter erall . Il999l ). We shall use the systematic scheme in our simulations. 



2.2. AdapUveness of SMC 

In contrast to MCMC, where designing adaptive algorithms require a careful conver- 
gence study, designing adaptive SMC samplers is s traightfo r ward. We consider first 
the design of the MCMC kernels Kt- For instance. IChopiiil l|2002f l uses independent 
Hastings-Metropolis kernels, with a Gaussian proposal fitted to the current particle 
sample. This is a reasonable strategy if ttj is close to Gaussianity. In this paper, we 
consider instead the following strategy, which seems more generally applicable: take Kt 
as a succession of k Gaussian random walk Hastings-Metropolis steps Kt,i{9,d9'), i.e. 
simulating from Kt.i{9,d9') consists of proposing a value 9' ~ Nd{9,T,t,i), accepting 
this value with probability 1 A {71(9') /n{9)}, otherwise keep the current value 9. Then 
take St,i — Ct,iSt, ct,i > 0, and St is the empirical covariance matrix of the resampled 
particles at iteration t (that is, the particles obtained immediately before the MCMC 
step with kernel Kt is performed). The constant Ct,i may be tuned automatically as 
well. For instance, one may start with co = 0.3, then, each time the acceptance rate 
of the MCMC step is below (resp. above) a given threshold, the constant ct is divided 
(resp. multiplied) by two. 

As in MCMC, it is common to focus on the adaptiveness of the transition kernels, 
but one may use the particle sample (or the history of the process in the MCMC 
context) to adapt the target distributions as well. This is precisely what we do in this 
paper, since the target at time t on our free energy SMC sampler shall depend on a 
bias function which is estimated from the current particle sample, see Section U) 

2.3. IBIS versus annealing, choice of ttq 
When the distribution of interest n is some Bayesian posterior density 



7t{9) ^ p{9\yi:n) = ^p{9)piyi:D\9), 
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Algorithm 1 A generic SMC algoritlim 



0. Sample 6*0, n ~ ttq, set wq „ = 1, for n = 1, . . . , A. Set t = 1. 

1. Compute new weights as 

Wt,n = Wt-l,n X Ut{9t-l,n)- 

2. If EF(i) < r, then 

(a) resample the particles, i.e. construct a sample {Ot.n)i<n<N made of i?t_„ 
replicates of particle 9t,n, I < n < N, where Rt,n is a nonnegative integer- 
valued random variable such that 

E [Rt,n] = — ]v — ) 

L„' = l Wt,n' 

and set Wf^n = 1. 

(b) move the particles with respect to Markov kernel Kt, 
otherwise 

9t,n = 0t~l,n- 

3. t ^t + l,iit <T goto Step 1. 



where yi-.o is a vector of D observations, p{0) is the prior density, and p(yi:D\d) is the 
likelihood, it is of interest to compare the two aforementioned SMC strategy, namely, 

1. IBIS, where T = D, and 7rt(S) — p{9\yi;t), in particular, ■ko{9) = p{9) is the 
prior; and 

2. Anneahng, where nt{0) = no{6y '*-k{6)''* , 7t is an increasing sequence such that 
7o = 0, and 7t = 1, ttq is typically the prior density, but could be something 
else, and T and D do not need to be related. 

Clearly, for the same number of particles, and assuming that the same number of 
resample-move steps is performed, IBIS is less time-consuming, because calculations 
at iteration t involve only the t first observations. On the other hand, annealing may 
produ ce a smoother seq uence of distributions, so it may require less resample-move 
steps. iJasra et al.l (|2007l ) provide numerical examples where the IBIS strategy leads to 
unstable estimates. In the context discussed in the paper, see Section[S] and elsewhere, 
we did not run into cases where IBIS is particularly unstable. Perhaps it is fair to say 
that a general comparison is not meaningful, as the performance of both strategies 
seems quite dependent on the applications, and also various tuning parameters such 
as the sequence 7t for instance. 

We take this opportunity however to propose a simple method to improve the regu- 
larity of the IBIS sequence, in the specific case where the observations are exchangeable 
and real- valued. We remark first that this regularity depends strongly on the order 
of incorporation of the yt's. For instance, sorting the observations in ascending order 
would certainly lead to very poor performance. O n the ot h er ha nd, a random order 
would be more suitable, and was recommended bv IChoplnl (|2002l ). Pushing this idea 
further, we propose the following strategy: First, we re-define the median of a sample 
as either the usual median, when D is an odd number, or the smallest of the two 
middle values in the ordered sample, when D is an even number. Then, we take yi as 
the median observation, y2 (resp. ys) to be the median of the observations that are 
smaller (resp. larger) than j/i, then we split again the four corresponding sub-samples 
by selecting some values 2/4 to yr, and so on, until all values are selected. We term 
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this strategy as "Van der Corput ordering", as a Van der Corput binary sequence is 
precisely defined as 1/2, 1/4, 3/4, 1/8, .. . 

A point which is often overlooked in the literature, and which affects both strate- 
gies, is the choice of ttq. Clearly, if 7ro(^) = p{d), one may take the prior so uninfor- 
mative that the algorithm degenerates in one step. Fortunately, in the application we 
discuss in this paper, namely Bayesian analysis of mixture models, priors are often 
informative; see Section [S] for a discussion of this point. In other contexts, it may 
be helpful to perform a preliminary exploration of vr in order design some ttq, quite 
possibly different from the prior, so that (i) for the annealing strategy, moving from 
TTo to ttt ~ TT does not take too much time; and (ii) for the IBIS strategy, one can 
use TTo as an artificial prior, and recover the prior of interest at the final stage of the 
algorithm, by multiplying all the particle weights by p{0)/no{9)- 

3. FREE ENERGY-BIASED SAMPLING 
3.1. Definition of free energy, and free- energy btased densities 

In this section we explain in more detail the concept of free energy biasing. We consider 
a single distribution of interest, defined by a probability density vr with respect to the 
Lebesgue measure associated to C R'^. As explained in the introduction, the first 
step in implementing a free energy biasing method is to choose a reaction coordinate, 
that is, some measurable function ^ : S — >■ R'' , where d' is small. In this paper, we take 
d' = 1. One assumes that the multimodality of vr is strongly determined, in some sense, 
by the direction (,{9). For instance, the distribution of 6, conditional on ^{6) — x, may 
be much less multimodal than the complete distribution tt, for either all or certain 
values of x. 

In words, the free energy is, up to an arbitrary constant, minus the logarithm of 
the marginal density of £,{6). The free energy may be written down informally as 

exp{-A(x)}oc / 7r(e)I[,,,+rf,j {^(0)} d6» 
Je 

and more rigorously, as 

exp{-A(a;)}oc / 11(6) d{e\£,{e) = x} , 

where fi^ = G : ^(6) = x}, and d{9\^{9) = x} denotes the conditional measure 
on the set Q,x which is "compatible" with Lebesgue measure on the embedding space O, 
i.e. volumes are preserved and so on. In both formulations, the proportionality relation 
indicates that the density tt may be known only up to a multiplicative constant, and 
therefore that the free energy is defined only up to an arbitrary additive constant. 
The free energy biased density tt is usually defined as 

n{e) (X 7v{e) exp {A o ^9)} I[.„,.„,.„..] im) 

where [aimin, 3;max] is some pre-defined range. It is clear that, with respect to tt, the 
marginal distribution of the random variable ^(^) is uniform over [a;min, aimax], and the 
conditional distributions of 9, given £^(9) = x matches the same conditional distribution 
corresponding to n. The objective is to sample from n, which requires to estimate the 
free energy A. 

To avoid the truncation incurred by the restriction to interval [a;min, a^max], we shall 
consider instead the following version of the free-energy biased density 7ft: 

iv{9) oc 7r(6l)exp{Ao^(6l)} 

where the definition of A is extended as follows: A{x) = A{xniin) for x < Xmin, 

A{x) = ^(Xmax) for X > 
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3.2. Estimation of the free energy 

As explained in the introduction, one usually resorts to some form of Adaptive MCMC 
to estimate the free energy A. Specifically, one performs successive MCMC steps 
(typically Hastings-Metropolis), such that the Markov kernel i('(t) used at iteration 
t depends on the trajectory of the simulated process up to time t — 1. (The simu- 
lated process is therefore non-Markovian.) The invariant distribution of kernel K^tf is 
'K^t){S) oc 7r(0) exp { °^(^)}, where is an estimate of the free energy A that 
has been computed at iteration t, from the simulated trajectory up to time t — 1. Note 
that the brackets in the notations K^t) , TTjt) , A^ f) indicate that all these quantities are 
specific to this section and to the Adaptive MCMC context, and must not mistaken 
for the similar quantities found elsewhere in the paper, such as, e.g. the density nt 
targeted at iteration t by a SMC sampler. The difficulty is then to come up with an 
efficient estimator (or rather a sequence of estimators, j4(t)), of the free energy. 

Since this paper is not concerned with adaptive MCMC, we consider instead 
the much simpler problem of estimating the free energy A from a weighted sample 
{9n,Wn)n^i,...,N targeting tt; for instance, the 9„'s could be i.i.d. with probability 
density g, and Wn = n{6)/g{6). Of course, this discussion is simplistic from an Adap- 
tive MCMC perspective, but it w ill be sufficient in our SMC context. We refer the 
reader to e.g. IChopin et al.l l|2010l ') for the missing details. 

First, it is necessary to discretise the problem, and consider some partition: 



1 

[^min , ^max] — '^i — ["^i ? -^i+l] 5 "^i — ^min ~t~ (^max ^min) . (1) 

rix 

Then, they are basically two ways to estimate A. The first method is to estimate 
directly a discretised version of A, by simply computing an estimate the proportion of 
points that fall in each bin: 



{^^(x)} = ^-^""^gf^"'^""'"^^^ for. 



EN 
n = l ' 

The second method is indirect, and based on the following property: the derivative of 
the free energy is such that 

A'ix) = K^ [fmm = x] 

where the force / is defined as: 

^_ (Vlog7r).(Vg) ^. J 



and V (resp. div) is the gradient (resp. divergence) operator. Often, ^{6) is simply a 
coordinate of the vector 9,9 = (^, ...), in which case the expression above simplifies to 
/ = — 91og7r/9^. This leads to the following estimator of the derivative of A: 

i^(.)^^-V-"^^^(^")^["'-'^-'>^(^"), for.e[..,...rl. 

Then an estimate of A may be deduced by simply computing cumulative sums for 
instance: 

j:xj <x 

Methods based on the first type of estimates are usually called ABP (Adaptive 
Biasing Potential) methods, while methods of the second type are called ABF (Adap- 
tive Biasing Force). Empirical evidence suggests that ABF leads to slightly smoother 
estimates, presumably because it is based on a derivative. 
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4. FREE ENERGY SMC 

We now return to the SMC context, and consider a pre-specified sequence (nt). Our 
objective is to derive a SMC algorithm which sequentially compute the free energy At 
associated to each density vrt, 

exp{~At{x)} <x J nt{e)d{e\^{e) ^x} 

and sample TTt, the free energy biased version of tti, 

nt{e)xTTt(e)exp{Ato^{e)}. 

Again, to avoid truncating to interval [ajmin, Xmax], one extends the definition of At 
outside [a;min, a^max] by taking At{x) = At(xinin) for x < Xmin, At{x) = ylt(3;max) for 
X > 

As explained in Section [3.21 one actually estimates a discretised version of the free 
energy, i.e., the algorithm shall provide estimates At{xi), i — 0,...,nx of the free 
energy evaluated at grid points over an interval [a;min, Smax], as defined in ([T|). Note 
that this grid is the same for all iterations t. 

Assume that we are at the end of iteration t — 1, that estimates At-i{xi) of 
At-i have been obtained, and that the particle system {9t — i,n^^t—i.n^n—i,...,N targets 
TTt-i- If the particles are re-weighted according to the incremental weight function 
ut(e) = nt{e)/nt-i{e), i.e. 

Wt,n = Wt-l.n X Ut{6t-l,n) 

then the new target distribution of the particle system {6t-i,n,wt,n)n=i,...,N is 

TTtie) xnt-i{e)utie). 

The objective is then to recover nt, which depends on the currently unknown free 
energy At- To that effect, we first state the following result. 

Theorem 1 The free energy Dt associated to Tft is 

Dt=At- At-i 
that is, the difference between the free energies of nt and nt-i. 
Proof. One has, for 9 S 0, 

7rt(e) 0C7rt(6l)exp{At_i o5(6l)} 

hence, for x G 5(6) > 

/ S-t(9)d{e|5(e) = x} =exp{(At_i -AOW}. 
and one concludes. r-i 



This result provides the justification for the following strategy. First, particles are 
reweighted from nt-i to 7ft, as explained above. Second, the free energy Dt of vft is 
estimated, using either the ABP or the ABF strategy, see Section [T2l this leads to some 
estimate Dt of Dt, ot more precisely estimates Dt{xi) over the grid xq, ■ ■ ■ , Xn^. From 
this, one readily obtains estimates of the current free energy, using the proposition 
above: 

At{xi) = At-i{xi) + Dt(xi), i^O,...,n^. (2) 

Third, one recovers TTt by performing an importance sampling step from 7ft to 7ft; this 
is equivalent to updating the weights as follows: 

■Wt,n = Wt,n exp I A O ^(6lt,n)| . 

An outline of this free energy SMC algorithm is given in Algorithm 2. 
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Algorithm 2 Free energy SMC 



0. Sample 0o,n ~ ttq, set wo.n = 1, for = 1, . . . , N. Compute Aq and set t = 1. 

1. Compute new weights as 

lVt,n = Wt-1,„ X UtiOt-l.n)- 

2. Compute an estimator IDt of free energy Dt, compute weights 



Wt,n = Wt,n exp 



and update the estimate At of the free energy At, using ([2]). 
3. If EF(i) < T, then 

(a) resample the particles, i.e. draw randomly Ot.n in such a way that 



E 



N 



t,n' — Ot-l.n 



{9t-l.n,Wt.n) 



Nwt,n 



and set wt.n = 1- 

(b) move the particles with respect to Markov kernel Kt, 



otherwise 
4. i ^ t + 1, if < < r go to Step 1 



't,n-Kt{et,n,de) 

Ot,n = dt-l,n- 
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At the final stage of the algorithm (iteration T), one recovers the unbiased target 
ttt = TT by a direct importance sampling step, from ttt to ttt'- 



TTT{e) 

oc exp 



This is because of this ultimate debiasing step, which relies on At, that one must store 
in memory and compute iteratively the "complete" free energy At (as opposed to the 
successive Dt, which may be termed as "incremental" free energies). If this unbiasing 
step is too "brutal", meaning that too many particles get a low weight in the final 
sample, then one may apply instead a progressive unbiasing strategy, by extending the 
sequence of distributions ttt as follows: 



iT+i{e) (X i^T{e) exp I At o e(e)| , ; = 0, 



and performing additional SMC steps, that is, successive importance sampling steps 
from -KT+i to TTT+i+i, and, when necessary, resample-move steps in order to avoid 
degeneracy. In our simulations, we found that progressive unbiasing did lead to some 
improvement, but that often direct unbiasing was sufficient. Hence, we report only 
results from direct unbiasing in the next Section. 

5. APPLICATION TO MIXTURES 

5.1. General formulation, multimodality 

A .fsT— component Bayesian mixture model consists of D independent and identically 
distributed observations j/i, with parametric density 

1 ^ 

piyi\0) ^ '^u}ki>iyi;^k), i^k>0, 

where {^'(■iOjC £ "} is some parametric family, e.g. ^(y,^) — N{y;^,l/\), ^ = 
(/i, A~^). The parameter vector contains 

6 = (a;i, . . . ,ajfc,5i, . . . ,Cfc,'?), 

where is the set of hyper-parameters that are shared by the K components. The 
prior distribution p{6) is typically symmetric with respect to component permutation. 
In particular, one may assume that, a priori and independently ujk ~ Gamma((5, 1). 
This leads to a Dirichlet_ft:(5, ■ ■ ■ ,5) prior for the component probabilities 

qk = — , k = l,...K. 

We note in passing that, while the formulation of a mixture model in terms of the q'f.s 
is more common, we find that the formulation in terms of the unnormalised weights 
oifc is both more tractable (because it imposes symmetry in the notations) and more 
convenient in terms of implementation (e.g. designing Hastings-Metropolis steps). 
An important feature of the corresponding posterior density 

D 

7T{e) = p{9\yi..D) X Pi9)l[p{y^\e), 

i = l 

assuming D observations are available, is its invariance with respect to "label per- 
mutation". This featur e and its bea r ings t o Monte Car l o infe r ence have received a 
lot of attention, see e.g. iCeleux et all (|200a '). lJasra et al.l (|2005l '). lChopin et all (|2010l ') 
among others. In short, a standard MCMC s ampler, such as the Gibbs sampler of 
iDiebolt and Robert! I\1994 ). see also the book of iFriihwirth-Schnatterl (|2006[ ). typically 
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visits a single modal region. But, since the posterior is symmetric, any mode admits 
K\ — 1 replicates in Q. The refore, one can assert that the sampler has not converged. 
iFriihwirth-Schnatteil (I2001D proposes to permute randomly the components at each 
iteration. However, iJeisra et al.l (|2005r ) mentions the risk of "genuine multimodality" , 
that is, the K\ symetric modal regions visited by the permutation sampler may still 
represent a small part of the posterior mass, because other sets of equivalent modes 
have not been visited. (Marin and Robert, 2007, Chap. 6) andX]hopin ct al] l|2010f ) 
provide practical examples of this phenomenon. 

One could say that random permut ations merely "cure t he most obvious s ymp- 
tom" of failed conv ergence. We follow ICeleux et all (|2000h . iJasra et all (|2005n and 
iChopin et al.l l|2010f ). and take the opposite perspective that one should aim at design- 
ing samplers that produce a nearly symmetric output (with respect to label switching), 
without resorting to random permutations. 

5.2. Univariate Gaussian mixtures 

5.2.1. Prior, reaction coordinates 

We first consider a univariate Gaussian mixt ure model, i.e. = N(y ; n, \~^), 

^ = (/i, A~^), and we use the same prior as in I Richardson and Greeiil (|l997f ). that is, 
for k = 1, . . . , K , independently, 

fj.k ~ N{A4, Afc ~ Gamma(a, /3), 

where a, M and k are fixed, and /3 is a hyper-parameter: 

P ~ Gamma((;, /i). 

Specifically, we take 5 = 1, a = 2 (see Chap. 6 of iFrtihwirth-SchnatterL l2006l for a 
justification), g = 0.2, h = lOOg/aR^ , M = y, and k = A/P?, where y and R are, 
respectively, the empirical mean and the range of the observed sample. 

Regarding the applicatio n of free energy me thods to univariate Gaussian mix- 
ture posterior distributions, IChopin et al.l (|2010l ) find that the two following func- 
tions of 9 are efficient reaction coordinates: ^(0) = /3, and the potential function 
V{Q) = —\og{p{6)p(yi:D\0)}, that is, up to a constant, minus log the posterior den- 
sity. However, the latter reaction coordinate is less convenient, because it is difficult 
to determine in advance the range [a;min, imax] of exploration. This is even more prob- 
lematic in our sequential context. Using the IBIS strategy for instance, one would 
define Vt{0) — — log {p{9)p{yi;t\0)} , but the range of likely values for Vt would typ- 
ically be very different between small and large values of t. Thus we discard this 
reaction coordinate. 

In constrast, as discussed already in IChopin et al.l (|2010l ). it is reasonably easy 
to determine a range of likely values for the reaction coordinate ^{0) = /?. In our 
simulations, we take [xmin, Xmaxi = [-R^/2000, i?^/20], where, again, R is the range 
of the data. IChopin et al.l (|201Clf ) explains the good performance of this particular 
reaction coordinate as follows. Large values of /3 penalise small component variances, 
thus forcing /3 to large values leads to a conditional posterior distribution which favours 
overlapping components, which may switch more easily. 

5.2.2. Numerical example 

We consider the most challenging exa mple discussed in IChopin et ahl ()2010l ). namely 
the Hidalgo stamps dataset, see e . g. [Izenman and Sommerl (|l988f ) for details, and 
K = 3. In particular, IChopin et al.l i|201ol ) needed about 10^ iterations of an Adaptive 
MCMC sampler (namely, an ABF sampler) to obtain a stable estimate of the free 
energy. 

We run SMC samplers with the following settings: the number of particles is 
N = 2 X 10'*, the criterion for triggering resample-move steps is ESS< O.SA'^, and a 
move step consists of 10 successive Gaussian random walk steps, using the automatic 
calibration strategy described in Section [2.21 

We first run a SMC sampler, without free energy biasing, and using the IBIS 
strategy. Results are reported in Figures [T] and [5] the output is not symmetric with 
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respect to label permutation, and only one modal region of the posterior distribution 
is visited. 
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Figure 1: Hexagon binning for (/ifc,logAfc), k — 1, 2, 3, for the standard SMC 
sampler, no free energy biasing, IBIS strategy. 




Figure 2: Weighted ID histograms for the standard SMC sampler, no free energy 
biasing, IBIS strategy. 



We then run a free energy SMC sampler, using the reaction coordinate ^, 50 bins, 
and the ABP strategy for estimating the free energies. Figures [3] and [4] represent the 
cloud of particles before the final unbiasing step, when the particles target the free 
energy biased density tvt- Figures [5] and [6] represent the cloud of particles at the final 
step, when the target is the true posterior distribution. One sees that the output 
is not perfectly symmetric (at least after the final debiasing step), but at least the 
three equivalent modes have been recovered, and one can force equal proportions for 
the particles in each modal region, by simply randomly permuting the labels of each 
particles, if need be. 
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Figure 3: Hexagon binning for (/Xfe,logAfc), fc = 1, 2, 3, for the free energy SMC 
sampler, before the final debiasing step, IBIS strategy. 
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Figure 4: Histograms of the components of the simulated particles obtained by 
free energy SMC sampler, before the final debiasing step, IBIS strategy. 




Figure 5: Hexagon binning for (/ife, log A^), fc = 1, 2, 3, for the free energy SMC 
sampler, after the final debiasing step, IBIS strategy. 
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Figure 6: Histograms of the components of the simulated particles obtained by 
free energy SMC sampler, after the final debiasing step, IBIS strategy. 

To assess the stability of our results, we run the same sampler ten times, and 
plot the ten so-obtained estimates of the overall free energy At, which is used in 
the last debiasing step; see Figure [7] Since a free energy function is defined only up 
to an additive function, we arbitrarily force the plotted functions to have the same 
minimum. 




Figure 7: Estimates of the final free energy At obtained from 10 runs of a free 
energy SMC sampler, versus cell indices 



In short, one sees in this challenging example that (a) a nearly symmetric output 
is obtained only if free energy biasing is implemented; and (b) using free energy SMC, 
satisfac tory results are obta ined at a smaller cost than the adaptive MCMC sampler 
used in lChoDin et~all (|2010l 'l. 

5.3. Bivariate Gaussian mixtures 
5.3.1. Prior, reaction coordinates 

We now consider a bivariate Gaussian mixture, V'(j/iC) — -^2(m, Q^^), which is 
parametrised as follows: 



Cfc = (Ml.fcj M2,/;, dl.fc, ^2,*;, fifc) , Ck ~ \ '" 1/2 



]k — CkCk ■ 
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This parametrisation is based on Bartlett decomposition: taking dj^k ~ Gamma(a/2, 
d2,k ~ Gamma((a — l)/2,/3), ek\P ~ N{0,1/P) leads to a Wishart prior for Qk, 
Qk ~ Wishart2(a, /3/2)- This parametrisation is also convenient in terms of imple- 
menting the automatically tuned random walk Hastings- Metropolis strategy discussed 
in Section [ 



To complete the specification of the prior, we assume that 



that Q = 2, and that /3 ~ Ga,mma{g, h). Of course, this prior is meant to generalise the 
prior used in the previous section in a simple way. In particular, the hyper-parameter 
/? should play the same role as in the univariate Gaussian case, and we use it as our 
reaction coordinate. 



5.3.2. Numerical results 



We consider two out of the four measurements recorded in F isher's Iris dataset, petal 
length and petal width, see e.g. iFriihwirth-Schnatterl (|2006l . Chap. 6), and Figure |5] 
for a scatter-plot. We take K = 2. As in the previous example, we run a standard 
SMC sampler (with the same number of particles, and so on), and observes that only 
one mode is recovered. We then run a free energy SMC sampler. For the sake of 
space, we report only the debiased output at the very final stage of the free energy 
SMC sampler, that is the cloud of particles targeting the true posterior distribution. 
Figure [5] represents the bivariate vectors fik , and Figure [TO] represent the component 
probabilities qk = ujk/ij-^i +1^2) for fc = 1, 2. Clearly, the output is nearly symmetric. 



One sees in this example that free energy SMC still works well for bivariate Gaus- 
sian mixture model, despite the larger dimension of the parameter space. In particular, 
the choice of the reaction coordinate seems to work along the same lines, i.e. choosing 
an hyper-parameter that determines the spread of the components. 



Figure 8: Iris sample 
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Figure 9: Hexagon binning for fik 
example 



(/^fc.i: Mfe,2), fc = 1, bivariate Gaussian 
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Figure 10: Weighted histograms of qk = ujk/{i^i +1^2), for fc = 1, 2, bivariate 
Gaussian example 

6. CONCLUSION 

In this paper, we introduced free energy SMC sampling, and observed in one mixture 
example that it may be fast er than free energy m ethods based on adaptive MCMC, 
such as those considered in IChopin et alj (|2010l ). It would be far-fetched to reach 
general conclusions from this preliminary study regarding the respective merits of free 
energy SMC versus free energy MCMC, or, worse, SMC versus Adaptive MCMC. If 
anything, the good results obtained in our examples validates, in the mixture context, 
the idea of combining two recipes to overcome multimodality, namely (a) free energy 
biasing, and (b) tracking through SMC some sequence (vrt) of increasing difficulty, 
which terminates at nr ~ Whether such combination should work or would be 
meaningful in other contexts is left for further research. 
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